Introducing FugueSQL — SQL for 
Pandas, Spark, and Dask DataFrames 


As a data scientist, you might be familiar with both Pandas and SQL. 
However, there might be some queries, transformations that you feel 


comfortable doing in SQL instead of Python. 
Wouldn’t it be nice if you can query a pandas DataFrame like below: 
import pandas as pd 


df = pd.DataFrame({i"coli": [1, 2, 3, 4], “col2": "a"; 
ue cae ACS g ead E 


df 
col1 col2 

0 1 a 

1 2 b 

2 3 E 

K 4 E 


.. using SQL? 


Or use a Python function within a SQL query? 


# schema: x*, col3:str 
def str_concat(df: pd.DataFrame, delimeter: str) -> 
pd.DataFrame: 
df = df.assign( 
col3=df["col1i"].astype(str) + delimeter + df["col2"] 
) 


return df 


SF Sql 
SELECT * 
FROM df 


TRANSFORM USING str_concat(delimeter="_") 
PRINT 


That is when FugueSQL comes in handy. 


What is FugueSQL? 


FugueSQL is a Python library that allows users to combine Python code and 
SQL commands. This gives users the flexibility to switch between Python 
and SQL within a Jupyter Notebook or a Python script. 


To install FugueSQL, type: 


pip install fugue[sql] 


To run on Spark or Dask execution engines, type: 


pip install fugue[sql, spark] 
pip install fugue[sql, dask] 
pip install fugue[all] 


In this article, we will explore some utilities of FugueSQL and compare 


FugueSQL with other tools such as pandasal. 


FugueSQL in a Notebook 


FugueSQL comes with a Jupyter notebook extension that allows users to 


interactively query DataFrames with syntax highlighting. 


To use it, import the setup function from fugue_notebook to register 


the%%fsql cell magic. 


from fugue_notebook import setup 
setup() 


To understand how the %%fsql cell magic, lers start with creating a pandas 


DataFrame: 
import pandas as pd 


df = pd.DataFrame({"coli": [1; 2, 3, 4], "coL": ["a’, 
Oey ae Se a eal R 


df 
coll col2 
0 1 a 
1 2 b 
2 3 C 
K 4 E 


Now, you can query like how you would normally do in SQL by adding the 
%%fsql at the beginning of the cell. 


$%é£sql 
SELECT * 
FROM df 
WHERE col2="c" 
PRINT 
coll col2 
0 3 e 
1 4 c 


schema: col1:long,col2:str 


In the code above, only PRINT does not follow standard SQL. This is similar 
to the pandas head() and Spark show() operations to display a number of 


rows. 


Operations such as GROUP BY are similar to standard SQL syntax. 


$s£sql 


SELECT col2, AVG(coll) AS avg coll 
FROM df 

GROUP BY col2 

PRINT 


col2 avg_col1 


o a 1.0 
1 b 2.0 
2 Cc 3.5 


schema: col2:str,avg_col1:double 


An Enhanced SQL Interface 


For SQL users, nothing shown above is out of the ordinary except for the 
PRINT statement. However, Fugue also adds some enhancements to 
standard SQL, allowing it to handle end-to-end data workflows gracefully. 


Deal With Temp Tables 


SQL users often have to use temp tables or common table expressions (CTE) 
to hold intermediate transformations. Luckily, FugueSQL supports the 


creation of intermediate tables through a variable assignment. 


For example, after transforming df , we can assign it to another variable 
called df2 and save df2 to a file using SAVE variable OVERWRITE file_name 


$s£sql 
df2 = SELECT * 
FROM df 
WHERE col2="c" 


SAVE df2 OVERWRITE '/tmp/df2.csv' (header=true) 


Now, if we want to apply more transformation to df2 , simply load it from 


the file we saved previously. 


$%fsql 
df3 = LOAD '/tmp/df2.csv' (header=true) 
SELECT * 


FROM df3 
PRINT 


col1 col2 
0 3 C 
1 4 6 


schema: col1:str,col2:str 


Pretty cool, isn’t it? 


Added Keywords 


SOL S grammar is meant for querying, which means that it lacks keywords 
to manipulate data. FugueSQL adds some keywords for common 


DataFrame operations. For example: 


s DROP 


$%fsql 


df4 = DROP COLUMNS col2 IF EXISTS FROM df 
PRINT df4 


colt 


0 1 
2 
2 3 
4 


schema: col1:long 


s FILL NULLS PARAMS 


import numpy as np 


null_df = pd.DataFrame( 
{"coli": [np.nan, np.nan, 1], 
"col?" [2; 3, npsnanls 


$s£sql 

-- Fill nan at coll with 1 and nan at col2 with 2 
df1 = FILL NULLS PARAMS coll:1, col2:2 FROM null df 
PRINT dfl 


Col col2 
O 1.0 2.0 
119520 
2 1.0 2.0 


schema: col1:double,col2:double 


s SAMPLE 


$sf£sql 

df2 = SAMPLE 2 ROWS SEED 42 FROM df 
PRINT df2 

df3 = SAMPLE 50 PERCENT SEED 1 FROM df 
PRINT df3 


col1 col2 
0 2 b 
1 4 G 


schema: col1:long,col2:str 


coll col2 
0 4 C 
1 3 G 


schema: col1:long,col2:str 


For a full list of operators, check the FugueSQL operator docs. 


Integrate With Python 


FugueSQL also allows you to use Python functions within a SQL query 
using TRANSFORM . 


For example, to use the function str_concat in a SQL query: 


def str_concat(df, delimeter): 
df = df.assign( 
col3=df["col1l"].astype(str) + delimeter + df["col2"] 
) 


return df 


... simply add the following components to the function: 


= Output schema hint (as a comment) 


s Type annotations (pd.DataFrame ) 


# schema: *,; col3istr 
def str_concat(df: pd.DataFrame, delimeter: str) -> 
pd.DataFrame: 
df = df.assign( 
col3=df["col1l"].astype(str) + delimeter + df["col2"] 
) 


return df 


Cool! Now we are ready to add it to a SQL query: 


$%f£sql 

SELECT * 

FROM df 

TRANSFORM USING str_concat(delimeter="_") 
PRINT 


coll col2 col3 

0 1 a 1a 
Zz b 2_b 

2 3 c 3C 
4 C 4&6 


schema: col1:long,col2:str,col3:str 


Scaling to Big Data 


FugueSQL Spark 


One of the beautiful properties of SQL is that it is agnostic to the size of the 
data. The logic is expressed in a scale-agnostic manner and will remain the 


same even if running on Pandas, Spark, or Dask. 


With FugueSQL, we can apply the same logic on the Spark execution engine 
just by specifying %%fsql spark . We don’t even need to edit the 


str_concatfunction to bring it to Spark as Fugue takes care of porting it. 


$%£sql spark 

SELECT * 

FROM df 

TRANSFORM USING str_concat(delimeter="_") 
PRINT 


col1 col2 col3 
0 1 a 1a 
D 2p 


2 
2 3 ù 2C 
4 


C46 


schema: col1:long,col2:str,col3:str 


PREPARTITION BY 


One of the important parts of distributed computing is partitioning. For 
example, to get the median value in each logical group, the data needs to be 


partitioned such that each logical group lives on the same worker. 


To describe this, FugueSQL has the PREPARTITION BY keyword. Fugue’s 
prepartition-transform semantics are equivalent to the pandas groupby- 
apply . The only difference is that prepartition-transformscales to the 
distributed setting as it dictates the location of the data. 


# schema: x 
def get_median(df: pd.DataFrame) -> List[DictI[str, Any]]: 
return [ 
{ 
"coli": df["col1"].median(), 
nooli: df["col2”|.1loclG] + 


$sf£sql spark 

SELECT * 

FROM df 

TRANSFORM PREPARTITION BY col2 USING get_median 
PRINT 


coll col2 
0 1 a 
1 2 b 
2 3 C 


schema: col1:long,col2:str 


Note that the get_median function above gets called once for each distinct 
value in the column col2 . Because the data is partitioned beforehand, we 
can just pull the first value of col2 to know what group we are working 
with. 


FugueSQL in Production 


To bring FugueSQL out of Jupyter notebooks and into Python scripts, all we 
need to do is wrap the FugueSQL query inside a fsql class. We can then call 
the .run() method and choose an execution engine. 


from fugue_sql import fsql 
import fugue_spark 


fsql( 
loi SELECT- & 
FROM df 
TRANSFORM PREPARTITION BY col2 USING get_median 
PRINT!" 
}. run("spark™) 


coll col2 
o 1 a 
1 2 b 


What Is the Difference Between 
FugueSQL and pandasql? 


If you know pandasql, you might wonder: Why should you use FugueSQL 
if pandasq] already allows you to run SQL with pandas? 


pandasq] has a single backend, SQLite. It introduces a large overhead to 
transfer data between pandas and SQLite. On the other hand, FugueSQL 
supports multiple local backends: pandas, DuckDB and SQLite. 


When using the pandas backend, Fugue directly translates SQL to pandas 
operations, so there is no data transfer at all. DuckDB has superb pandas 
support, so the overhead of data transfer is also negligible. Both Pandas and 


DuckDB are preferred FugueSQL backends for local data processing. 


Fugue also has support for Spark, Dask, and cuDF (through blazingSQL) as 
backends. 


Conclusion 


Congratulations! You have just learned how to use FugueSQL as a SQL 
interface for operating on Python DataFrames. With FugueSQL, you can 
now use SQL syntax to express end-to-end data workflows and scale to 


distributed computing seamlessly! 


This article does not exhaustively cover FugueSQL features. For more 


information about Fugue or FugueSQL, check the resources below. 


s Github Repo 
s FugueSQL Documentation 


s Fugue Slack 


Feel free to play and fork the source code of this article here. 


